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Abstract 

Most web pages are linked to others with related content. 
This idea, combined with another that says that text in, 
and possibly around, HTML anchors describe the pages 
to which they point, is the foundation for a usable World- 
Wide Web. In this paper, we examine to what extent these 
ideas hold by empirically testing whether topical locality 
mirrors spatial locality of pages on the Web. In partic- 
ular, we find that the likelihood of linked pages having 
similar textual content to be high; the similarity of sibling 
pages increases when the links from the parent are close 
together; titles, descriptions, and anchor text represent at 
least part of the target page; and that anchor text may be 
a useful discriminator among unseen child pages. These 
results show the foundations necessary for the success 
of many web systems, including search engines, focused 
crawlers, linkage analyzers, and intelligent web agents. 

1 Introduction 

Most web pages are linked to others with related content. 
This idea, combined with another that says that text in, 
and possibly around, HTML anchors describe the pages 
to which they point, is the foundation for a usable World- 
Wide Web. They make browsing possible, since users 
would not follow links if those links were unlikely to point 
to relevant and useful content. These ideas have also been 
noticed by researchers and developers, and are implicit in 
many of the systems and services found on the Web today. 
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These ideas are so basic that in many cases they are not 
mentioned, even though without them the systems would 
fail to be useful. When one or both are mentioned explic- 
itly (as in [26, 16, 17, 3, 20, 8, 10, 2]), their influence is 
measured implicitly, if at all. This paper is an attempt to 
rectify the situation — we wish to measure the extent to 
which these ideas hold. 

This paper primarily addresses two topics: it examines 
the presence of textual overlap in pages near one another 
in the web, and the related issue of the quality of descrip- 
tions of web pages. The former is most relevant to fo- 
cused web crawlers and to search engines using link anal- 
ysis, while the latter is primarily of use to web indexers, 
meta-search tools, and to human browsers of the web since 
users expect to find pages that are indeed described by link 
text (when browsing the Web) and to find pages that are 
described accurately by the descriptive text presented by 
search engine results. We show empirical evidence of top- 
ical locality in the Web, and of the value of descriptive 
text as representatives of the targeted page. In particular, 
we find that the likelihood of linked pages having simi- 
lar textual content to be high; that the similarity of sibling 
pages increases when the links from the parent are close 
together; that titles, descriptions, and anchor text represent 
at least part of the target page; and that anchor text may be 
a useful discriminator among unseen child pages. 

For the experiments described in this paper, we select 
a set of pages from the Web and follow a random subset 
of the links present on those pages. This provides us with 
a corpus in which we can measure the textual similarity 
of nearby or remote pages and explore the quality of ti- 
tles, descriptions, and anchor links with respect to their 
representation of the document so described. In the next 
section, we will describe the motivation of this work in 
further detail, giving examples from many applications, 
including web indexers, search ranking systems, focused 
crawlers and web prefetchers. We will then describe our 
experimental methodology, present the results found, and 
conclude with a summary of our findings. 
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2 Motivation 

The World-Wide Web is not a homogeneous, strictly- 
organized structure. While small parts of it may be or- 
dered systematically, many pages have links to others that 
appear almost random at first glance. Fortunately, fur- 
ther inspection generally shows that the typical web page 
author does not place random links in her pages (with 
the possible exception of banner advertising), but instead 
tends to create links to pages on related topics. This prac- 
tice is widely believed to be typical, and as such under- 
lies a number of systems and services on the web, some 
of which are described below. 

Additionally, there is the question of describing the web 
pages. While it is common for some applications to just 
use the contents of the web pages themselves, there are 
situations in which one may have only the titles and/or de- 
scriptions of a page (as in the results page from a query of a 
typical search engine), or only the text in and around a link 
to a page. A number of systems could or do assume that 
these "page proxies" accurately represent the pages they 
describe, and we include some of those systems below. 

2.1 Web indexers 

A web indexer takes pages from the web and generates an 
inverted index of those pages for later searching. Popular 
search engines including AltaVista 1 , Lycos 2 , etc. all have 
indexers of some sort that perform this function. How- 
ever, many search engines once indexed much less than 
the full text of each page. The WWW Worm [25], for ex- 
ample, indexed titles and anchor text. Lycos, at one time, 
only indexed the first 20 lines or 20% of the text [21]. 
More recently Google 3 started out by indexing just the ti- 
tles [8]. 

Today it is common for the major engines to index not 
only all the text, but also the title of each page. Smaller 
services such as research projects or intranet search en- 
gines may opt for reduced storage and index less. What is 
less common is the indexing of HTML META tags con- 
taining author-supplied keywords and descriptions. Some 
search engines will index the text of these fields, but others 
do not [32], citing problems with search engine spamming 
(that is, some authors will place keywords and text that are 
not relevant to the current page but instead are designed to 
draw traffic for popular search terms). 

Likewise, while indexers typically include anchor text 
(text within and/or around a hypertext link) as some of 
the terms that represent the page on which they are found, 
most do not use them as terms to describe the page refer- 
enced. One significant exception is Google, which does 
index anchor text. By doing so, Google is able to present 

1 http://www.altavista.com/ 
^ http://www.lycos.com/ 
3 h ttp ://w w w. goo gl e. c om/ 



target pages to the user that have not been crawled, or 
have no text, or are redirected to another page. One draw- 
back, however, is that this text might not in fact be related 
to the target page. A recent publicized example was the 
query "more evil than satan himself" which, at least for 
a while, returned Microsoft as the highest ranked answer 
from Google [31]. 

So, for search engine designers, we want to address the 
questions of how well anchor text, title text, and META 
tag description text represent the target page's text. Even 
when title and descriptions are indexed, they may need to 
be weighted differently from terms appearing in the text 
of a page. Our goal is to provide some evidence that may 
be used in making decisions about whether to include such 
text (in addition to or instead of the target text content) in 
the indexing process. 

2.2 Search ranking systems 

Traditionally, search engines have used text analysis to 
find pages relevant to a query. Today, however, many 
search engines incorporate additional factors of user popu- 
larity (based on actual user traffic), link popularity (that is, 
how many other pages link to the page), and various forms 
of page status calculations. Both link popularity and sta- 
tus calculations depend, at least in part, on the assumption 
that page authors do not link to random pages. Presum- 
ably, link authors want to direct their readers to pages that 
will be of interest or are relevant to the topic on the current 
page. The link analysis approaches used by Clever 4 [20] 
and others [6, 8, 15] depend on having a set of intercon- 
nected pages that are both relevant to the topic of interest 
and richly interconnected in order to calculate page status. 
Additionally, some [10] use anchor text to help rank rele- 
vance of a query to communities discovered from the anal- 
ysis. 

LASER [7] demonstrates a different use of linkage in- 
formation to rank pages. It computes the textual rel- 
evance, and then propagates that relevance backwards 
along links that point to the relevant pages. The goal is 
to enable the engine to find pages that are good starting 
points for automated crawling, even if those pages don't 
rank highly based on text alone. 

Our analysis may help to explain the utility of anchor 
text usage, as well as show how likely neighboring pages 
are to be on the same topic. 

2.3 Meta-search engines 

Meta- search engines (e.g. MetaCrawler 5 [30], Savvy - 
Search 6 [18]) are search services that do not search an in- 
dex of their own, but instead collect and compile the re- 

4 h ttp ://www. a lm aden . ibm . com/cs/k5 3/c 1 ever, h tml 

5 http ://www. me tacrawler.com/ 
6 http ://www. savvy s earch.com/ 
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suits of searching other engines. While these services may 
do nothing more than present the results they obtained for 
the client, they may want to attempt to rank the results 
or perform additional processing. Grouper [33], for ex- 
ample, performs result clustering. While Inquirus [22] 
fetches all documents for analysis on full-text, a simpler 
version (perhaps with little available bandwidth) might 
decide to fetch only the most likely pages for further anal- 
ysis. In this case, the meta-engine has only the informa- 
tion provided by the original search engines (usually just 
URL, title, and description), and the quality of these page 
descriptors is thus quite important to a post- hoc textual 
ranking or clustering of the pages. 

2.4 Focused crawlers 

Focused crawlers are web crawlers that follow links that 
are expected to be re levant to the client's interest (e.g. [11, 
4, 26, 24] and the query similarity crawler in [12]). They 
may use the results of a search engine as a starting point, or 
they may crawl the web from their own dataset. In either 
case, they assume that it is possible to find highly relevant 
pages using local search starting with other relevant pages. 
Dean and Henzinger [16] use a similar approach to find 
related pages. 

Since focused crawlers may use the content of the cur- 
rent page, or anchor text to determine whether to expand 
the links on a page, our examination of nearby page rele- 
vance and anchor text relevance may be useful. 

2.5 Intelligent Browsing Agents 

There have been a variety of agents proposed to help peo- 
ple browse the web. Many of those that are content-based 
depend on the contents of a page and/or the text contained 
in or around anchors to help determine what to suggest to 
the user (e.g. [19, 27, 24, 26, 3]) or to prefetch links for 
the user (e.g. [24, 13,28]). 

By comparing the text of neighboring pages, we can 
estimate the relevance for pages neighboring the current 
one. We also find out how well anchor text describes the 
targeted page. 

3 Experimental Method 

3.1 Data Set 

3.1.1 Initial Data Set 

Ideally, when characterizing the pages of the WWW, one 
would choose a random set of pages selected across the 
Web. Unfortunately, while the Web has been estimated to 
contain hundreds of millions of pages [23], no one entity 
has a complete enumeration. Even the major search en- 
gines, with a few hundred million pages in their databases 
only know of a fraction of the web, and the pages retained 



in those datasets are biased samples of the Web. As a re- 
sult, the unbiased selection of a random subset of the Web 
is an open question [5], 

Accordingly, the data set used as the starting points in 
this paper were selected at random from a subset of the 
web. We randomly selected 100,000 pages out of the ap- 
proximately 3 million pages that our local research search 
engine (DiscoWeb [15]) had crawled by early December 
1 999. The pages in the DiscoWeb dataset at that time were 
generated primarily from the results of inquiries made to 
the major search engines (such as HotBot 7 and AltaVista) 
plus pages that were in the neighborhood of those results 
(i.e. direct ancestors or descendants of pages in those re- 
sults). Thus, selecting pages from this dataset will bias our 
sample toward pages in the neighborhood of high-ranking 
English-language pages (that is, pages near other pages 
that have scored highly on some query to a search engine). 

3.1.2 Remaining Data Set 

From the initial data set, we randomly selected one outgo- 
ing link per page and retrieved those pages. We also ran- 
domly reselected a different outgoing link per page (where 
possible) and fetched those pages as well. The latter set 
was used for testing anchor text relevance to sibling pages 
and to measure similarity between sibling pages. 

3.1.3 Retrieval, Parsing, and Textual Extraction 

The pages were retrieved using the Perl LWP::User Agent 
library, and were parsed with the Perl HTML::TreeBuilder 
library. Text extraction from the HTML pages was per- 
formed using custom code that down-cased all terms and 
dropped all punctuation so that all terms are made strictly 
of alphanumerics. Content text of the page does not in- 
clude title or META tag descriptions, but does include alt 
text for images. URLs were parsed and extracted using 
the Perl URI::URL library plus custom code to standardize 
the URL format (down -casing host, dropping #, etc.) to 
maximize matching of equivalent URLs. The title (when 
available), description (when available), and non-HTML 
body text were recorded, along with anchor text and target 
URLs. The anchor text included the text within the link 
itself (i.e. between the <a> and </a>), as well as sur- 
rounding text (up to 20 terms but never spanning another 
link). The basic representation of each textual item was 
bag-of-words with term frequency. 

3.2 Textual Similarity Calculations 

To perform the textual analysis, we used three straight- 
forward calculations, which we describe in this section. 
While each of the measures can be applied to any pair of 
documents, we will sometimes use the term "query" when 

7 http://www.hotbot.com/ 
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we refer to the "document" composed of the words in the 
source document (e.g. the title words, or description, or 
anchor text, or in general the first document of a pair). 

Note that all measures have the following two proper- 
ties: they produce scores in the range [0..1]; and identical 
documents generate a score of 1 while documents having 
no words in common generate a score of 0. 

3.2.1 TFIDF cosine similarity 

The first calculation selected was TFIDF, for its 
widespread use and long history in information re- 
trieval. Note that the IDF values are calculated from the 
documents in the combined retrieved sample, not over 
the entire Web. The specific formulas used were: 



TFIDF(7«i,P) = 



TF(^,P)*IDF(wQ 
x/E a /^,(TFKP)*IDFW)2 



where 



TF(w, P) — Ig (number of times w appears in P + 1) 
and 

IDF(™) 



ig(: 



number of docs 4- 1 



number of docs with term w 
So each document has the value 0 or a TFIDF value for 
each term, which are then normalized (divided by the sum 
.of the values) so that the values of the terms in a document 
sum to 1 . For document similarity, we use the cosine mea- 
sure. TFIDF-Cos(Q,P) = 

Hail w TFIDF(w, Q) * TFIDF(w, P) 
y/Zaii TFIDFK QY * £«« w TFIDF(u,, P) 2 

3.2.2 Query term probability 

The second measure is designed to measure the likelihood 
of a term in the query being present in the target document. 
It is simply the sum of the fractions of the query corre- 
sponding to query terms that are also present in the target 
document: 

Number of times w appears in Q 



Fract(w, Q) = 



Number of terms in Q 



Prob(Q,P) = { 



Fract(w,Q) if?/; 6 P 
0 otherwise 



3.2.3 Query-document overlap 

The third measure used was chosen to measure the amount 
of overlap of the two documents, after being normalized 
for differences in length. Thus, to calculate this measure 
we sum over all terms the smaller of the representative 
fractions of each document: 

Overlap(Q,P) == ^ min(Fract(«;, P), Fract(™, Q)) 

all w 
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Figure 1 : Representation of the twenty most common top-level 
domain names in our combined dataset, sorted by frequency. 
The top ten domains are .com, .edu, .org, .net, .uk, .de, .us, .ca, 
.gov, and .au. 



3.3 Experiments Performed 

The primary experiments performed include measuring 
the textual similarity: 

• of the title to its page, and of the description to its 
page 

• of a page and one of its children 

• of a page and a random page 

• of two pages with the same direct ancestor (i.e. be- 
tween siblings) and with respect to the distance in the 
parent document between referring URLs 

• of anchor text and the page to which it points 

• of anchor text and a random page 

• of anchor text and a page different from the one to 
which it points (but still linked from the parent page) 

Additionally, we measured lengths of titles, descrip- 
tions (text provided in the description META tag of the 
page), anchor texts, and page textual contents. We also ex- 
amined how often links between pages were in the same 
domain, and if so, the same host, same directory, etc. 

We also performed experiments with stop word elimi- 
nation and Porter term stemming [29], but for space limi- 
tations are omitted below (the results are similar, and are 
included in a longer technical report [14]). No other fea- 
ture selection was used (i.e., all terms were included). 

4 Experimental Results 
4.1 General characteristics 

For a baseline, we first consider characteristics of the 
overall dataset. Out of the initial 100,000 URLs se- 
lected, 89,891 were retrievable. An additional 111,107 
unique URLs were retrievable by randomly fetching 
two child links from each page of the initial set (when- 
ever possible). The top five represented hosts were: 
www.geocities.com (561 URLs), www.webring.com 
(419 URLs), www.amazon.com (303 URLs), mem- 
bers. aol.com (287 URLs), and www.tripod.com (196 
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Figure 2: Distribution of content lengths of web pages. 
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Figure 3: Distributions of URL match lengths are similar for 
parent-childl , parent-child2, and childl-child2 (siblings). 
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URLs). Combined, they represent less than 1% of the 
URLs used. Figure 1 shows the most frequent top-level 
domains in our data; close to half of the URLs are from 
.com, and another 26.8% of the URLs came from .edu, 
.org, and .net. Approximately 1 8% of the URLs represent 
top-level home pages (i.e. URLs with a path component 
of just/). 

With respect to content length, the sample distributions 
used for source and target pages are similar, so we present 
one distribution (pages from the initial dataset containing 
titles), shown in figure 2. Thus it can be seen that almost 
half of the web pages contain 250 words or less. 

For pairings of pages with links between them, the do- 
main name matched 55.67% of the time. For pairings of 
siblings, the percentage was 46.32%. For random pairings 
of pages, the domain name matched 0.003% of the time. 

We also measured the number of segments that matched 
between URLs. A score of 1 means that the host name 
and port (more strict than just domain name matching) 
matched. For each point above 1, an additional path seg- 
ment matched (i.e. top-level directory match would get 
2; an additional subdirectory would get 3, and so on). 
The distributions of these segment match lengths for con- 
nected pages are shown in figure 3 . 

Figure 4 shows similarities for the author-supplied 
same-page descriptors (titles and description META tag 
contents). Descriptions show poorer performance than ti- 
tles for both TFIDF and term probabilities, suggesting that 
authors often include terms not present in the page being 
described. With longer text in descriptions than in titles, 
we find that descriptions have higher overlap with the con- 
tent, but not as much as the increased length of the descrip- 
tion would suggest. 



Figure 4: Similarity scores for title, description, and ti- 
tle+description as compared to the text on the same page. 
Comparisons between scores in a graph are significant 
(p < .01). 

4.2 Page to page characteristics 

Figure 5 presents the similarity scores of the current page 
to the linked page, to random pages, between sibling 
pages, and to subsets of the linked pages. All three metrics 
demonstrate that random page texts have almost nothing 
in common, linked page texts have more in common when 
the links are between pages of the same domain, and that 
sibling pages are more similar than linked pages of differ- 
ent domains. 

In figure 6, we plot sibling page similarity scores as a 
function of distance between referring URLs in the parent 
page. We find that in general, the closer two URLs are, 
the more likely they are to share the same terms. This is 
most strikingly found for TFIDF-Cosine similarity, but it 
is present in all three metrics. This is corroborated by oth- 
ers [16, 9] who have observed that links to pages on sim- 
ilar topics are often clustered together on the parent page. 

4.3 Anchor to page characteristics 

Anchor text, by itself, has a mean length of 2.69 terms 
(slightly lower than the average reported by Amitay [1]). 
In comparison, titles have a mean length of 5.27 terms. 
However, we can also consider using text before or after 
the anchor text, and when we consider using up to 20 terms 
before and 20 terms after, we get a mean of 1 1 .02 terms. 

Figure 7 shows that anchor text scores much higher for 
non-random pages for each of the metrics. Even the sim- 
ilarity of anchor text to pages that are siblings of the tar- 
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Figure 5: Textual similarity for linked pages, random pages, sib- 
ling pages, linked pages in the same domain, and linked pages in 
different domains. Comparisons between scores in a graph are 
significant (p < .01). 



geted page get scores at least an order of magnitude better 
than random. There are also some conflicting results: in 
7a and 7c, the highest scoring performance goes to anchor 
text to linked pages of a different domain than the source 
page, but this is not the case for term probabilities in 7b. 

The mean TFIDF scores (figure 8a) for anchor text plus 
varying amounts of surrounding text are relatively consis- 
tent (and the distributions, not shown, for each version are 
almost identical). While there is some improvement as 
more text is added, it is very small. The term probabilities 
(figure 8b), on the other hand, show a decline when addi- 
tional words are used. Apparently the additional text pro- 
vided has a much lower likelihood of being present in the 
target page. For example, the additional terms (.76 terms, 
on average) when allowing one additional word on each 
side of the anchor, have only a 51% chance of being in 
the target page (as compared to the 65% chance for anchor 
text terms). Unlike the others, overlap scores in figure 8c 
show some improvement as additional words are used. 

While potentially confusing, these results are compat- 
ible to those reported by Chakrabarti et al. [10]. They 
found that including fifty bytes of text around the anchor 
would catch most references of the term "Yahoo" for a 
large dataset of links to the Yahoo home page 8 . Our in- 
terpretation is that while additional text does increase the 
chance of getting the important term(s), it also tends to 
catch more unimportant terms, lowering the overall term 
probability scores (as seen in 8b), but almost cancelling 
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Figure 6: Plots of similarity scores between sibling pages as a 
function of distance between referring URLs in parent page for 
TFIDF, Term Probability, and Overlap, respectively. Each has a 
high negative correlation coefficient (r < —.79). 



each other out in 8a. While these results may not be par- 
ticularly encouraging, text surrounding the anchor is occa- 
sionally quite useful (especially for link text made of low- 
content terms like "click here"). 

5 Conclusions 

Text on the Web is not the same as text off the Web. Ami- 
tay [1] examines the linguistic choices that web authors 
use in comparison to non-hypertext documents. Without 
going into the same detailed analysis, we did find some 
similar characteristics of web pages. The bigrams "click 
here" and "home page" were the 1 1 th- and 1 3th-most pop- 
ular, and certainly not typical bigrams of off- Web text. 
Interestingly, "all rights" and "rights reserved" were the 
sixth- and seventh-most popular, perhaps reflecting the in- 
creasing commercialization of the Web. 

This paper provides empirical evidence of topical local- 
ity of pages mirroring spatial locality in the Web — that is, 
WWW pages are typically linked to other pages with sim- 
ilar textual content. We found that pages are significantly 
more likely to be related topically to pages to which they 
are linked, as opposed to other pages selected at random, 
or other nearby pages. Furthermore, we found evidence of 
topical locality within pages, in that sibling pages are more 
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Figure 7: Performance of anchor text only to linked text, linked 
text in a different domain, linked text in the same domain, text of 
a sibling of the link, and the text of random pages. Comparisons 
between scores in a graph are significant (p < .01). 



similar when the links from the parent are closer together. 

We also found that anchor text is most similar to the 
page it references, followed by siblings of that page, and 
least similar to random pages, and that the differences in 
scores are statistically significant (p < .01) and often large 
(an order of magnitude or more). This suggests that an- 
chor text may be useful in discriminating among unseen 
child pages. We note that anchor text terms can be found 
in the target page close to as often as the title terms on that 
target page, but that the titles also have better overlap and 
TFIDF cosine similarity scores. We have pointed out that 
on average the inclusion of text around the anchor does not 
particularly improve similarity measures (but neither does 
it hurt). Finally, we have shown that titles, descriptions, 
and anchor text all have relatively high mean term proba- 
bilities (and high mean TFIDF scores), implying that these 
page proxies represent at least part of the target page well. 

Pitkow and Pirolli [28] have observed that "hyperlinks, 
when employed in a non-random format, provide seman- 
tic linkages between objects, much in the same manner 
that citations link documents to other related documents." 
We have demonstrated that this semantic linkage, as ap- 
proximated by textual similarity, is measurably present 
in the Web, thus providing the underpinnings for various 
web systems, including search engines, focused crawlers, 
linkage analyzers, and intelligent web agents. 
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